Zhang et ol. BMC Bioinformotics 2013, 14(Suppl 8):S2 
http://www.biomedcentral.eom/1 471 -21 05/1 4/S8/S2 



Bioinformatics 



PROCEEDINGS Open Access 



Locating tandem repeats in weighted sequences 
in proteins 

Hui Zhang\ Qing Guo^" Costas S lliopoulos^ 

From The 2012 International Conference on Intelligent Computing (ICIC 2012) 
Huangshan, China. 25-29 July 2012 



Abstract 

A weighted biological sequence is a string in which a set of characters may appear at each position with 
respective probabilities of occurrence. We attempt to locate all the tandem repeats in a weighted sequence. 
A repeated substring is called a tandem repeat if each occurrence of the substring is directly adjacent to each 
other. By introducing the idea of equivalence classes in weighted sequences, we identify the tandem repeats of 
every possible length using an iterative partitioning technique. We also present the algorithm for recording the 
tandem repeats, and prove that the problem can be solved in O(n^) time. 



Introduction 

A weighted biological sequence, called for short a 
weighted sequence, is a special string that allows a set of 
characters to occur at each position of the sequence 
with respective probability, instead of a fixed single 
character occurring in a normal string. It can be viewed 
as a compressed version of multiple alignment which 
shows strength in extracting and representing the con- 
served commonalities of a set of sequences. 

Weighted sequences are apt at summarizing poorly 
defined short sequences, e.g. transcription factor binding 
sites, the profiles of protein families and complete chro- 
mosome sequences [1]. With this model, one can attempt 
to locate the motifs of biological importance, to estimate 
the binding energy of the proteins, even to infer the evolu- 
tionary homology. It thus exhibits theoretical and practical 
significance to design powerful algorithms on weighted 
sequences in proteins. 

This paper concentrates on locating those tandem 
repeats in a weighted sequence. Tandem repeats occur in 
a string when a substring is repeated for two or more 
times and each repetition is directly adjacent to each 
other. For example. The substring ATT occurs in the 
string X = CA TT A TT A TTG for three times, and each 
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occurrence of ATT is consecutive, one after the other. 
Then A rr is a tandem repeat of length 3 of X. 

The motivation for investigating tandem repeats in 
weighted sequences comes from the striking feature of 
DNA that vast quantities of tandemly repetitive seg- 
ments occur in the genome, with high proportion of 
more than 50 percent in fact [2]. Some examples are 
microsatellite, minisatellite, and satellite DNA. 

It should be noticed that tandem repeats are not redun- 
dant information, but of either functional or evolutionary 
significance [3]. For instance, tandem repeats frequently 
occur within or in the proximity of genes, i.e., either in the 
untranslated regions up and downstream of open reading 
frames, within introns, or in coding regions [4]. Recent 
evidence supports that tandem repeats in these regions 
can play a significant role in regulating gene expression 
and modulating gene function[5]. Thus it is of great biolo- 
gical interest to locate tandem repeats in biological DNA 
sequences and proteins. 

It has been an effort for a long time to identify special 
areas in a biological sequence by their structure. Large 
amount of work has been done to find all tandem repeats 
in non- weighted strings. Technically, these solutions can be 
divided into two main categories. One employed traditional 
string comparison and searching method, where the most 
famous algorithms were Crochemore's partioning [6] and 
LZ decomposition [7], with time complexity 0{n log n) 
respectively. The other computed tandem repeats by 
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constructing suffix tree and suffix array. Although needing 
extra memory, these algorithms can also reach 0{n log n) 
time by limiting the number of output [8-12]. 

However, relatively less work has been studied in 
weighted sequences circumstance. Iliopoulos et aL[13,14] 
were the first to touch this field, and extract repeats and 
other types of repetitive motifs in weighted sequences by 
constructing weighted suffix tree. Weighted suffix tree was 
built simulating suffix tree, with the distinction that the 
weight of each substring should be considered. This 
directly led to a big size and its strong dependence on the 
presence probability of the weighed suffix tree. Another 
solution [15, 16] used the partitioning technique based on 
KMR algorithm to find tandem repeats of length d in 0{n 
log d) time. But they did not give efficient algorithm for 
computing the tandem repeats of all lengths. 

On the other hand, a lot of recent results of studies on 
identifying hot spots in proteins enlightened us. Huang 
et al. [17] firstly utilized the support vector machine(SVM) 
classifier based upon the hydropathy blocks to classify pro- 
tein sequences. Then Xia et al.[18] used support vector 
machine (SVM) to predict hot spot residues in protein 
interfaces. Selecting nine individual features from 62 
features, they developed a new ensemble classifier APIS to 
further improve the prediction accuracy. You et al. [19] 
developed a robust manifold embedding technique 
for assessing the reliability of interactions and predicting 
new interactions, which was reinterpreted into the problem 
of measuring similarity between points of its metric space 
after transforming a given PPI network into a low dimen- 
sional metric space using manifold embedding based on 
isometric feature mapping. Zheng et al. [20] employed 
independent component analysis for gene selection, 
then introduced gene selection and explicitly enforcing 
sparseness into nonnegative matrix factorization for tumor 
clustering. Wang et al. [21] proposed a novel tumor classifi- 
cation method based on correlation filters other than the 
model to identify the overall pattern of tumor subtype hid- 
den in genes. 

The paper focuses on finding tandem repeats of all 
length in a given weighted sequence in proteins. The 
paper is organized as follows. In the next section we give 
the necessary theoretical preliminaries used, then intro- 
duce the all-tandem-repeats problem and explains why 
Crochemore's partitioning algorithm cannot be adapted 
to weighted sequences. After that, we present our algo- 
rithm for computing all the tandem repeats in weighted 
sequences, and give experimental results to verify the 
algorithm's performance. Finally we conclude and discuss 
our research interest. 

Preliminanes 

A biological sequence used throughout the paper is a 
string either over the 4-character DNA alphabet 27 ={A, 



C,G,T} of nucleotides or the 20-character alphabet of 
amino acids. Assume that readers have essential knowl- 
edge of the basic concepts of strings, now we extend 
parts of them to weighted sequences. Formally speaking: 
Definition 1 Let an alphabet he X = {Ci, 0*2, ... , CJi}. 
A weighted sequence X over i7, denoted by X [I, n] = X 
[1]X [2] . . . X[n\y is a sequence of n sets X[i] for 1 < i < 
n, such that: 

X [i] = I (aj, TTi (a,)) I 1 <i<l, Jti (oj) > 0, and Jti (a,) = 1 1 

Each X[i] is a set of couples (o), nt (o))), where 71^(0)) is 
the non-negative weight of O} at position /, representing 
the probability of having character O} at position i of X 

Let X be a weighted sequence of length cr be a charac- 
ter in i7. We say that a occurs at position / of X if and only 
if TTiio) > 0, written as cr g X[i]. A nonempty non- weighted 
string /[I, m] (m g [1, n]) occurs at position / of X if and 
only if position / + ; - 1 is an occurrence of the character/ 
[/] in X, for all 1 < J < m. Then /is said to be a factor of X, 
and / is an occurrence of/ in X. 

The probability of the presence of/ at position / of X 
is called the weight of/ at /, written as 7r^(/), which can 
be obtained by using different weight measures. We 
exploit the one in common use, called the cumulative 
weighty defined as the product of the weight of the char- 
acter at every position of / : Tif (f) = Y\._^ if {]])■ 

Considering the following weighted sequence of 
length 5: 



X: 



(A, 0.5) 
(C,0.25) 
(G, 0.25) 



(A, 0.6) 
(C,0.4) 



(A, 0.25) 
(C, 0.25) 
(G, 0.25) 
(T, 0.25) 



C (1) 



the weight of/= GAT at position 2 of X is: 7T2(f) = 1 x 
0.6 X 0.25 = 0.15. That is, GAT occurs at position 2 of X 
with probability 0.15. Note that for clarity, we employ a 
simplified vertical representation method for a weighted 
sequence, where the probability 1 can be ignored by sim- 
ply remaining the character with probability 1. 

A factor/ of a weighted sequence X is called a repeat in 
X if there exist at least two distinct positions of X that are 
occurrences of/ in X As a special case of repeat, tandem 
repeats can be formally defined as follows. 

Definition 2 A factor / of length of a weighted 
sequence X is called a tandem repeat in X if there exists a 
triple (/,/ /) such that for each 0 < 7 </ - 1, position / + jp 
is an occurrence of the factor/ in X 

It is easy to see that, the difficulty for locating the tan- 
dem repeats in weighted sequences arises from uncertain- 
ties of weighted sequences. Firstly, different characters 
might occur at the same position, which yields multiple 
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factors of equal length at each position of the weighted 
sequence. Secondly, as each character occurs at one posi- 
tion with respective probability, the corresponding factors 
produced also have different presence probabilities, thus 
the weight of each appearance of a factor /can be highly 
different. 

As scientists pay more attention to the pieces with 
high probabilities in DNA sequences, we fix a constant 
threshold for the presence probability of the motif, that 
is, only those occurrences with probability not less than 
this threshold are counted. 

Definition 3 Let / be a factor of length <i of a 
weighted sequence X that occurs at position a real 
constant threshold k > 1. We say that /is a real factor 
of X if and only if the weight (probability) of/ at /, 7r^(/), 

is at least ^. Exactly, Yf_^ ^i+j-i (f [/]) ^ p 

In the above example (1), set 1/k = 0.3, then AGA is a 
real factor of X that occurs at position 1 since jti{AGA) = 

0. 5 X 1 X 0.6 = 0.3 > 0.3, while CAC is not a real factor of 
X at position 3 since n^iCAC) = 0.1 < 0.3. 

The all-tandem-repeats problem 

Now we introduce the all-tandem-repeats problem in 
weighted sequences as below: 

Problem 1 Given a weighted sequence X[l, n] and a 
real constant k > 1, the all-tandem-repeats problem 
identifies the set S of all triples (/,/ /), where 1 < | / 1 < 
nl2 and /is a real factor of X, 

Our algorithm for picking all the tandem repeats is 
based on the following idea of equivalence relation on 
positions of a string: 

Definition 4 Given a string x of length n over an inte- 
ger G {1, 2, . . . , n}, S be a set of positions of x: {1, 2, ... , 
n - p + 1), then Ep is defined to be an equivalence relation 
on S such that: for two positions /, / g 5, (/, /) g £p if x[U i + 
p-l] =x\jj +p- 1]. 

In the following context, a nonempty substring of x of 
length p is called a p-substring of x. Clearly, two positions 
/ and ; of x are said to be /7-equivalent when two /?-sub- 
strings starting at / and / in x are identical. Although this 
definition is defined on non-weighted strings, it can also 
be extended to weighted sequences. Before presenting our 
algorithm, we first introduce Crochemore's partitioning 
algorithm[6] for computing tandem repeats in non- 
weighted sequences. The algorithm employs the following 
idea of equivalence class and partition. 

Definition 5 Consider the substring w = x[i, i + p - 1] 
for / G S. The set of all positions of x that are related to 

1. e, {J\{iJ) e Ep, J g 5}, is called the equivalence class of /, 
or alternatively, the equivalence class associated with 
denoted by C^,,. 

Definition 6 Let 5i, 5^2, . . . , 5^ be nonempty subsets 
of 5, we say that 5^2, ... , 5^} is a partition of S if: 



(i) 5 = .Si U ^2 U . . . U 5^ 

(ii) Sf \ Sj= 0 for 1 < /, j < r and / ^ j. 

For an equivalence relation Ep on a set 5, all the 
equivalence classes of Ep, called Ep-classes, compose a 
partition of S, since every element of S falls into exactly 
one £p-class. We also say that S is partitioned into a 
family of £^-classes. In this sense, partitions and equiva- 
lence relations are the same. 

It is obvious that each £^-class of cardinality not less than 
two records the occurrences of a repetitive /7-substring of x. 
Hence, the problem of computing all the repeated p-svh- 
strings of x can be rephrased as finding the partition of Ep, 

Observe that Ep+i is a refinement of Ep by excluding the 
position n - p + 1, Thus the equivalence relations can be 
iteratively constructed by starting with £i, then succes- 
sively building £2^ etc., until El such that each £/^-class 
is a singleton who refers to a set that consists of only one 
element. Crochemore efficiently executed this iterative 
computation and located all the tandem repeats mx'mO 
(nlogn) time by introducing the following ideas: 

- Small-classes: Consider the refinement from Ep to Ep+i . 
Assume that an £^-class C is partitioned into r £^+i-classes, 
we call the one of maximal size a big class of C, and the 
other r - 1 s are small classes, 

- Smaller-half trick : The trick depends on the follow- 
ing Lemma: 

Lemma 1 Let x be a string of length n, p e {1, 2, ... , 
«}, ; G {1, 2, ... y n - p}. Then: 

LLCS{X, Y) /n 

Therefore, instead of partitioning all ^^-classes at stage 
Py the algorithm simply examines each small £p-class SC 
and partitions those related classes RC such that {RC\ i g 
RC and / + 1 g SC}. Simply speaking, for any ^^-class C, 
only the positions that will be transferred into small Ep^^- 
classes are assigned new indexes, while the big £^+i-class 
directly inherits the index of C. 

The running time of this algorithm is proportional to 
the union of small classes. By definition, all the E^- 
classes are small, with cardinality less than n. As each 
small £p,+i-class has the size not greater than half of the 
cardinality of its corresponding £p-class, a position can- 
not belong to a small class more than logn times. There- 
fore, the partitioning algorithm takes 0{nlogn) time for a 
string of length n. 

Although proved to be optimal, this algorithm cannot 
conform to a weighted sequence X due to the following 
reasons: 

1. Multiple distinct characters may occur at the same 
one position of a weighted sequence. In this case, a 
position may goes to more than one equivalence classes 
associated with different substrings of the same length, 
thus the smaller-half trick makes no sense. 
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2. In weighted sequence circumstance, the presence 
probabiUty of any factor should not be ignored as it is 
restricted by the probability threshold. 

Our algorithm 

As we stated above, Crochemore's algorithm cannot be 
directly used in weighted sequence, but it enlightens us 
to borrow the idea of partitioning. By improving the 
method for computing repeated patterns in weighted 
sequences we proposed in [22], we first simulate the 
definition for f^-classes of non-weighted strings, and 
give the corresponding weighted version: 

Definition 7 Consider a factor / of length p in 3. 
weighted sequence n]. An ^^-class associated with/ 
is the set Cf(p) of all position-probability pairs, denoted 
by JTiif)), such that /occurs at position / with prob- 
ability TTiif) > Ilk. 

Cfip) is an ordered list that contains all the positions 
of X where /occurs. Note that only the occurrences of 
those real factors are considered. For this reason, the 
probability of each appearance of a factor should be 
recorded and kept for the next iteration. 

Although tandem repeats are special cases of repeats 
in weighted sequences, the following facts draw a dis- 
tinction between the algorithms for computing tandem 
repeats and the repeats we proposed before. 

Fact 1 The occurrences of a tandem repeat are not 
overlapping. 

Fact *llf a factor f is a tandem repeat ofX, any conse- 
cutive alignment of f should not he reported as a tandem 
repeat again. 

For instance, a string AT AT AT AT will report a tan- 
dem repeat (1, AT, 4), not (1, AT AT, 2). According to 
the above facts, tandem repeats can be timely filtered 
during the construction of equivalence classes. 

Note that in this construction process, a position i is 
allowed to go to several but no more than \I\ different Ep- 
classes, due to the uncertainty of weighted sequences. 
Though, we follow to use the notion "partition" to describe 
the process of building £^-classes from £^_i-classes, which 
can be computed based upon the following corollary: 

Corollary 1 Let p^ {1,2,..., «}, /, / e {1, 2, ... , fz - p}. 
Then: 

{{i, nAf)), ij, nj(f))) e C» iff ((i. n,(f), 
a, 7Tj{f'))€Cf (p - l)and{ii+p (CT)), (/ + 

p-h JTj^p_i{a))) e Ccj(l) 

where a e X,f and fare two factors of length p and p - 1 
respectively, such that f = fa and Tii{f) > Ilk, TTj{f) > Ilk 

Our algorithm for picking all the tandem repeats of X 
then operates as follows: 

1. "Partition" all the n positions of X to build Ei and 
detect all the tandem repeats of length 1: For every 
character CJ e 27, create a class Ccj(l) that is an ordered 
list of couples (/, TTiia)), where / is an occurrence of a in 



X with probability not less than Ilk. Each class com- 
posed of more than one element forms Ei. Those Ccr(l)s 
in which the distance between two or more adjacent 
position / is 1 report the tandem repeats of length 1. 

2. Iteratively compute £p,-classes from £^_i-classes using 
the above corollary for p > 2, and find all the tandem 
repeats of length p: Take each class Op - 1) of Ep_i, parti- 
tion C{p - 1) so that any two positions /, ; e C(p - 1) go 
to the same ^^-class if positions i+p-l, j+p-1 
belongs to a same £1 -class, and this £p-class represents a 
real factor of X 

3. For each £^-class C(p) partitioned by C(p - 1), test if 
the factor associated with C(p) is a tandem repeat of X: If 
the cardinality of C{p) is at least two and any distance 
between two or more adjacent positions in C(p) equals p, 
add the corresponding triple into the tandem repeat set S- 
Eliminate those C{p)s who are singletons, and keep the 
rest to proceed the iterative computation at stage p -\- 1. 

4. The computation stops at stage L, once no new El+i- 
classes can be created or each ^^-class is a singleton. 

Algorithm 1 Compute all the tandem repeats of a 
weighted sequence 
Input: a weighted sequence X[l, n], k > 2 ^ R 
Output: all the tandem repeats of X 
1: Algorithm Compute-Tandem-Repeats(X, k) 
2: for / ^ 1 to « do 
3: 0 

4: for ; ^ 1 to \Z\ do 
5: for each O)^ X[i] do 

6: while m^i (a,) > - do 

^ ^ k 
7: add(/ + /, 7z-,+/(aj)) to Ca^{l) 
8: 

9: if / > 1 then 

10: 5 ^ 5 U (f, o)-, I) 
11: p^l 

12: while p < f and there is a non-singleton class C 
(p - I) of Ep_i or Ep_i^ 0 do 
13: {Cf{p - 1),/) <— extract a pair from Ep_i list 
14: SUB <r- Create-Equiv-Class(C/(/? - 1),/) 
15: p <^ p -\- 1 
16: add SUB to Ep 

We use a doubly linked list to store each equivalence 
class, which needs 0{n) space for a bounded-size alphabet. 
The computation for tandem repeats is demonstrated as 
Algorithm 1, which repeatedly calls function Create- 
Equiv-Class. Algorithm 2 depicts the procedure to con- 
struct all possible £^-classes from a certain £p,_i-class, and 
report those tandem repeats of length p. It is easy to see 
that Algorithm 1 takes 0{n^) time for a constant-size 
alphabet, since each refinement of Ep from Ep_i costs lin- 
ear time, and there are 0{n) stages in total. The running 
time of Algorithm 2 is proportional to the size of the given 
£p_i-class, since tandem repeats of length p are reported 
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along with the partitioning of the given -class. Taking 
all the £p_i-classes into account, stage p requires 0{n) 
time and 0(n) extra space. Thus the overall time complex- 
ity of finding all tandem repeats of every possible length 
amounts to 0{n^). 
Algorithm 2 Identify tandem repeats of length p 
Input: An £^_i-pair: class Cf{p - 1), a factor /corre- 
sponding to Cp_i 
Output: All the E^-pairs derived from the input 



1 

2 

3: 

4: 

5 

6: 

7 

8: 

9: 

10; 

11 

12 

13 

14; 

15 

16; 

17 



Function Create-Equiv-Class(Cy - 1),/) 
for each (/, tt^ (/)) g C^p - 1) do 
0 

for each Oj^ X [i + p - 1] do 

T^i ifj) <r- TTi if) X TTi+ p - l(CJy) 

while TTi+z (fj) > I do 

add(/ + /, TTi+l(j)) to Cf. {p) 
/ +1 

if / > 1 then 

S ^Su{hfjJ) 
for each ; do 

if \Cf.{p)\ = Ithen 

delete Cf. (p) 
else 

add [Cf.iplfj) to Ep 

return 



Theorem 1 The all-tandem-repeats problem can be 
solved in O(n^) time. 



Experimental results 

To verify the running time of our algorithm, we imple- 
mented the algorithm, programmed in C++, for locat- 
ing all the tandem repeats in a given weighted 
sequence. The experiment environment is a Intel 
Core2 Duo CPU P8700 2.5GHz system, with 2GB of 
RAM, under the Microsoft Windows XP operating sys- 
tem (SP2). 

In our experiments, the family of SR (serine/arginine 
rich) proteins SC35 across species and alleles was 
used. We transformed the alignment of the sequences 
[23] to a weighted sequence as the input data. Firstly, 
we fixed the presence probability threshold to be a 
small constant, then simply tested the performance of 
the algorithm with respect to the size of the weighted 
sequence, denoted by n. In this case, set the constant 
1/k = 0.01. Figure 1 demonstrates the running time 
curve of our algorithm with respect to n. It is easily 
observed that, the algorithm runs in 0{n^) time as 
expected. 

As we stated before, our algorithms is heavily dependent 
on the presence probability. We then fixed the size of the 
input weighted sequence to be 400, and executed our algo- 
rithm considering different presence probabilities. Figure 2 
gives the time consumption of the the algorithm with 
respect to the presence probability l/Zc. Clearly, the run- 
ning time grows exponentially as the probability threshold 
gets smaller. 



25000 



20000 
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E 10000 
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600 800 1000 

Size of Weighted Sequence 
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1400 



Figure 1 Time consumption with respect to n. 
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Figure 2 Time consumption with respect to the threshold ^ / k. 



Conclusions 

The paper investigated the tandem repeats arisen in 
weighted sequences. As opposed to the non-weighted 
version, the uncertainty of weighted sequences and the 
presence probabiUty of every character in the sequence 
must be considered. We devised efficient algorithm for 
identify all the tandem repeats in a weighted sequence, 
which operates in 0{n^) time. 

Note that if \X\ are sufficiently large, the total number 
of repeats might be very huge. In the worst case, i.e. 
each character of X appears at every position of the 
weighted sequence, the total number of repeats of a 
weighted sequence can be exponential, that is 0(|i7|'^). 
This fact of considering equivalence-classes of positions 
seems to lead to a quadratic algorithm. If \X\ is rela- 
tively small, and the number of weighted positions in 
the weighted sequence is bounded, the algorithm 
appears to be running in 0{n^) time as expected. 
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